This document is the summary of the R for Data Analysis workshop.
All correspondence related to this document should be addressed to:
Omid Ghasemi (Macquarie University, Sydney, NSW, 2109, AUSTRALIA)
Email: omidreza.ghasemi@hdr.mq.edu.auArtwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations
R can be used as a calculator. For mathematical purposes, be careful of the order in which R executes the commands.
10 + 10
## [1] 20
4 ^ 2
## [1] 16
(250 / 500) * 100
## [1] 50
R is a bit flexible with spacing (but no spacing in the name of variables and words)
10+10
## [1] 20
10 + 10
## [1] 20
R can sometimes tell that you’re not finished yet
10 +
How to create a variable? Variable assignment using <- and =. Note that R is case sensitive for everything
pay <- 250
month = 12
pay * month
## [1] 3000
salary <- pay * month
Few points in naming variables and vectors: use short, informative words, keep same method (e.g., you can use capital letters but it is not recommended, use only _ or . ).
Function is a set of statements combined together to perform a specific task. When we use a block of code repeatedly, we can convert it to a function. To write a function, first, you need to define it:
my_multiplier <- function(a,b){
result = a * b
return (result)
}
This code do nothing. To get a result, you need to call it:
my_multiplier (a=2, b=4)
## [1] 8
# or: my_multiplier (2, 4)
We can set a default value for our arguments:
my_multiplier2 <- function(a,b=4){
result = a * b
return (result)
}
my_multiplier2 (a=2)
## [1] 8
# or: my_multiplier (2)
# or: my_multiplier (2, 6)
Fortunately, you do not need to write everything from scratch. R has lots of built-in functions that you can use:
round(54.6787)
## [1] 55
round(54.5787, digits = 2)
## [1] 54.58
Use ? before the function name to get some help. For example, ?round. You will see many functions in the rest of the workshop.
function class() is used to show what is the type of a variable.
TRUE, FALSE can be abbreviated as T, F. They has to be capital, ‘true’ is not a logical data:class(TRUE)
## [1] "logical"
class(F)
## [1] "logical"
class(2)
## [1] "numeric"
class(13.46)
## [1] "numeric"
class("ha ha ha ha")
## [1] "character"
class("56.6")
## [1] "character"
class("TRUE")
## [1] "character"
Can we change the type of data in a variable? Yes, you need to use the function as.---()
as.numeric(TRUE)
## [1] 1
as.character(4)
## [1] "4"
as.numeric("4.5")
## [1] 4.5
as.numeric("Hello")
## Warning: NAs introduced by coercion
## [1] NA
When there are more than one number or letter stored. Use the combine function c() for that.
sale <- c(1, 2, 3,4, 5, 6, 7, 8, 9, 10) # also sale <- c(1:10)
sale <- c(1:10)
sale * sale
## [1] 1 4 9 16 25 36 49 64 81 100
Subsetting a vector:
days <- c("Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
days[2]
## [1] "Sunday"
days[-2]
## [1] "Saturday" "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
days[c(2, 3, 4)]
## [1] "Sunday" "Monday" "Tuesday"
my_vector with numbers from 0 to 1000 in it and calculate mean, median, sd, min, max, and sum of that vector:my_vector <- (0:1000)
mean(my_vector)
## [1] 500
median(my_vector)
## [1] 500
min(my_vector)
## [1] 0
range(my_vector)
## [1] 0 1000
class(my_vector)
## [1] "integer"
sum(my_vector)
## [1] 500500
sd(my_vector)
## [1] 289.1081
List allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other list.
my_list = list(sale, 1, 3, 4:7, "HELLO", "hello", FALSE)
my_list
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4 5 6 7
##
## [[5]]
## [1] "HELLO"
##
## [[6]]
## [1] "hello"
##
## [[7]]
## [1] FALSE
Factors store the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character. For example, variable gender with “male” and “female” entries:
gender <- c("male", "male", "male", " female", "female", "female")
gender <- factor(gender)
R now treats gender as a nominal (categorical) variable: 1=female, 2=male internally (alphabetically).
summary(gender)
## female female male
## 1 2 3
gender
## [1] male male male female female female
## Levels: female female male
So, be careful of spaces!
rep() function):gender <- c(rep("male",30), rep("female", 40))
gender <- factor(gender)
gender
## [1] male male male male male male male male male male
## [11] male male male male male male male male male male
## [21] male male male male male male male male male male
## [31] female female female female female female female female female female
## [41] female female female female female female female female female female
## [51] female female female female female female female female female female
## [61] female female female female female female female female female female
## Levels: female male
There are two types of categorical variables: nominal and ordinal. How to create ordered factors (when the variable is nominal and values can be ordered)? We should add two additional arguments to the factor() function: ordered = TRUE, and levels = c("level1", "level2"). For example, we have a vector that shows participants’ education level.
edu<-c(3,2,3,4,1,2,2,3,4)
education<-factor(edu, ordered = TRUE)
levels(education) <- c("Primary school","high school","College","Uni graduated")
education
## [1] College high school College Uni graduated Primary school
## [6] high school high school College Uni graduated
## Levels: Primary school < high school < College < Uni graduated
patient and control values. Here, the first level is control and the second level is patient. Change the order of levels, so patient would be the first level:health_status <- factor(c(rep('patient',5),rep('control',5)))
health_status
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: control patient
health_status_reordered <- factor(health_status, levels = c('patient','control'))
health_status_reordered
## [1] patient patient patient patient patient control control control control
## [10] control
## Levels: patient control
Finally, can you relabel both levels to uppercase characters? (Hint: check ?factor)
health_status_relabeled <- factor(health_status, levels = c('patient','control'), labels = c('Patient','Control'))
health_status_relabeled
## [1] Patient Patient Patient Patient Patient Control Control Control Control
## [10] Control
## Levels: Patient Control
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. It can be created using a vector input to the matrix function.
my_matrix = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, ncol = 3)
my_matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Data frames can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Let’s create a dataframe:
id <- 1:200
group <- c(rep("Psychotherapy", 100), rep("Medication", 100))
response <- c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5))
my_dataframe <-data.frame(Patient = id,
Treatment = group,
Response = response)
We also could have done the below
my_dataframe <-data.frame(Patient = c(1:200),
Treatment = c(rep("Psychotherapy", 100), rep("Medication", 100)),
Response = c(rnorm(100, mean = 30, sd = 5),
rnorm(100, mean = 25, sd = 5)))
In large data sets, the function head() enables you to show the first observations of a data frames. Similarly, the function tail() prints out the last observations in your data set.
head(my_dataframe)
tail(my_dataframe)
| Patient | Treatment | Response | |
|---|---|---|---|
| 1 | 1 | Psychotherapy | 19.45338 |
| 2 | 2 | Psychotherapy | 42.34821 |
| 3 | 3 | Psychotherapy | 35.61285 |
| 4 | 4 | Psychotherapy | 25.17399 |
| 5 | 5 | Psychotherapy | 24.42101 |
| 6 | 6 | Psychotherapy | 27.48738 |
| Patient | Treatment | Response | |
|---|---|---|---|
| 195 | 195 | Medication | 27.49956 |
| 196 | 196 | Medication | 28.42176 |
| 197 | 197 | Medication | 28.92391 |
| 198 | 198 | Medication | 29.06204 |
| 199 | 199 | Medication | 27.61078 |
| 200 | 200 | Medication | 18.95131 |
Similar to vectors and matrices, brackets [] are used to selects data from rows and columns in data.frames:
my_dataframe[35, 3]
## [1] 31.15347
my_dataframe[1:10, ]
| Patient | Treatment | Response |
|---|---|---|
| 1 | Psychotherapy | 19.45338 |
| 2 | Psychotherapy | 42.34821 |
| 3 | Psychotherapy | 35.61285 |
| 4 | Psychotherapy | 25.17399 |
| 5 | Psychotherapy | 24.42101 |
| 6 | Psychotherapy | 27.48738 |
| 7 | Psychotherapy | 33.11080 |
| 8 | Psychotherapy | 42.36993 |
| 9 | Psychotherapy | 40.41996 |
| 10 | Psychotherapy | 23.41229 |
How to get only the Response column for all participants?
my_dataframe[ , 3]
## [1] 19.45338 42.34821 35.61285 25.17399 24.42101 27.48738 33.11080 42.36993
## [9] 40.41996 23.41229 28.24108 32.81045 25.70502 29.22300 32.27024 33.05726
## [17] 24.37990 31.37091 31.16245 28.76672 38.54995 22.87417 22.20308 23.74506
## [25] 25.12223 38.02695 28.44873 30.55478 36.07095 29.76718 37.62122 33.47683
## [33] 32.37533 31.40624 31.15347 34.65073 22.47355 27.59239 35.01169 36.09953
## [41] 29.29690 32.29629 31.98906 23.27326 27.44773 31.86654 17.90334 22.97421
## [49] 30.87948 29.53606 29.08800 30.08908 21.95537 35.25648 30.40872 30.87351
## [57] 23.57119 29.42636 32.90266 31.78335 35.53556 27.94127 28.05376 21.32260
## [65] 25.00423 19.52317 37.05042 29.95080 21.99706 26.90433 22.47090 31.14327
## [73] 40.54168 27.98746 28.14105 31.15181 30.93725 30.65365 23.85351 32.85104
## [81] 33.16058 32.15789 24.70513 24.71835 29.07822 29.34360 29.97579 24.19834
## [89] 27.12373 29.85477 27.23954 31.12225 31.72837 30.66292 28.12180 29.38940
## [97] 35.45928 35.13103 23.55602 21.51178 23.86999 24.08943 21.19567 23.76538
## [105] 31.23768 20.12949 23.35748 25.77560 19.97935 36.95775 33.83174 27.06497
## [113] 20.82996 16.90353 20.63556 22.35301 20.31938 28.19941 31.16086 14.80019
## [121] 30.02522 28.20146 24.90908 17.37278 30.53754 25.74269 24.67774 19.84581
## [129] 15.30539 24.30986 12.59414 21.31042 18.77066 25.84767 28.36952 18.72623
## [137] 26.18819 16.69716 22.39854 23.14091 24.53845 29.56114 18.33816 30.06442
## [145] 26.48275 21.59022 19.16012 27.35389 21.63158 18.79327 20.71436 20.73652
## [153] 30.98289 29.92019 21.84922 24.98820 25.63485 28.40387 28.10618 27.49502
## [161] 22.00292 31.93532 20.70760 26.42307 30.18167 15.52820 21.29023 19.23984
## [169] 19.23642 20.05807 25.03167 31.89290 26.70301 23.68235 20.84122 17.86026
## [177] 33.32782 30.41839 25.10713 23.05902 22.40902 27.36197 24.81034 22.85248
## [185] 22.74278 20.66957 22.41073 19.05380 26.37348 24.48729 30.65553 26.59533
## [193] 30.89649 20.92523 27.49956 28.42176 28.92391 29.06204 27.61078 18.95131
Another easier way for selecting particular items is using their names that is more helpful than number of the rows in large data sets:
my_dataframe[ , "Response"]
# OR:
my_dataframe$Response
So far, we created dataframes using data.frame function from the base R. However, a better way to create dataframes is to use the tibble function from tidyverse (see here).